Skip to main content

Regression Model Builder

Description

Regression Model Builder step builds a regression model based on training data.

Configurations

No.Field NameDescription
1Step NameSpecify the name of the step. Step names should be unique within a workflow.
2Number of Rows to ProcessSelect the number of rows that you want to process. Available options are:
- All
- Batch
Governs if all the rows of dataset are passed in one shot or they are batched. Typically if you are building model on a very large dataset, you can use Batch row processing.
3SizeSelect the batch size of the dataset. If your dataset has 50,000 rows, 1,000 can be a good batch size candidate.
Note: You can specify a dataset, if you have selected Batch in the Number of Rows to Process field.
4Build Using AE Model VersionSpecify the AE model version.
Available options are: Version 1.0 [Python 3.6] and Version 2.0 [Python 3.8]
5File nameSpecify the name of the file that contains the model.
6AlgorithmSelect the algorithm to build the model. Available algorithms are:
- Linear Regression
- Random Forest Regression
- Support Vector Regression
7Tuning AlgorithmsSelect the hyper tuning parameters.
Note: Grid Search is supported currently.
8Algorithm Parameters*Provide / select the algorithm parameters.
Note: The algorithm parameters available in the Alogrithm Parameters field depend on the selected algorithm. For more details, see Algorithms.
Field Mapping Tab
1NameSpecify the name of the input field that needs to be passed for model building purpose
2Incoming TypeSpecify the data type of the field. The data type can be either string or number.
3Text PreprocessingAll the classification algorithms work on vectors of numbers. Fields which are of type String need to be converted internally to numeric vectors and this cell lets you specify all the Text Processing attributes on that field. This cell can be clicked only for fields with String data type. Ensuing dialog when you click on it has two tabs.
- First tab lets you specify one or more text processing options.
- Remove punctuation: removes standard punctuation marks from the text.
- Remove Stop Words: removes stop words like ‘the’, ‘as’, ‘in’, and so on.
- Additional Stop Words: choose a simple text file where every additional stop word is there on a separate line. These are your domain specific stop words.
- Lemmatization: Converts words, such as mice to mouse, houses to house, and so on.
- Stemming: Gets stem of the word no matter what word form is used in the text. Therefore, going, went, and goes is converted to go.
- Second tab lets you Test your text processing options. In the text box next to ‘Value:’ you can type any text. Clicking on ‘Test’ button will give you the text in the text box next to ‘Result:’ taking into account text processing options you have selected.
4Get FieldsClick to get values from the previous step.
5Class/Target FieldSpecify the target field for the regression.

When you are processing a feature of type string, as mentioned in ‘Text Processing’ section of above table, this feature needs to be converted into numeric features. Text Vectorization Tab governs how all string features get converted into numeric features. An n-gram is a contiguous 240 of 565 Plugin Reference sequence of n items from a given sample of text or speech. Table below shows how internally a string gets tokenized given different values of n-gram.

StringN,Gram Start/EndTokens
Weather today is good1-1'Weather', 'today', 'good'
Weather today is good1-2'Weather', 'today', 'good', 'Weather today', 'today good'
Weather today is good1-3'Weather', 'today', 'good', 'Weather today', 'today good', 'Weather today good'
Weather today is good2-3'Weather today', 'today good', 'Weather today good'

*is treated as stop word and not considered

No.Field NameDescription
Text Vectorization Tab
1N Gram startSpecify a numeric value with minimum of 1.
2N Gram startSpecify a numeric value greater than or equal to N Gram start.
3Vectorization- N-Gram operation tokenizes input string feature. Vectorization is the operation where these tokens are converted to numeric features which are needed by the algorithms. There are three types of vectorizers supported
- Count Vectorizer: It counts the number of times a token shows up in the document and uses this value as its weight.
- Tfidf Vectorizer: TF-IDF stands for “term frequency-inverse document frequency”, meaning the weight assigned to each token not only depends on its frequency in a document but also how recurrent that term is in the entire corpora.
- Hashing Vectorizer: It is designed to be as memory efficient as possible. Instead of storing the tokens as strings, the vectorizer applies the hashing trick to encode them as numerical indexes. The downside of this method is that once vectorized, the features’ names can no longer be retrieved.
Evaluation Tab
1Evaluation TypeChoose an Evaluation Algorithm Type from the drop down list as seen in the snapshot below
- None: Choose None if Evaluation is not needed
- Train/Test Split: This Evaluation Algorithm splits the data into Train and Test as per parameters specified below. The data we use is usually split into training data and test data. The training set contains a known output and the model learns on this data in order to be generalized to other data later on. We have the test dataset (or subset) in order to test our model’s prediction on this subset.
2Test PercentageFor Train/Test Split:
Data Types allowed: default value float, int or None, optional (default=None)
- If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split.
- If int, represents the absolute number of test samples.
If None, it will be set to 0.25.
3Random StateFor Train/Test Split:
Data Types allowed: int, RandomState instance or None, optional (default=None)
- If int, random_state is the seed used by the random number generator;
- If RandomState instance, random_state is the random number generator;
If None, the random number generator is the RandomState instance used by np.random.
4Evaluation Output File NameAbsolute html report output file path.
5Add output filename to resultEnable checkbox to display downloadable link of html report output file on AE portal.

Algorithms The following table lists the algorithms along with a description and snapshots of corresponding parameters.

No.Algorithm DescriptionAlgorithm Parameter Description
1Linear Regression:
LinearRegression fits a linear model with coefficients w = (w1, …, wp) to minimize the residual sum of squares between the observed targets in the dataset, and the targets predicted by the linear approximation.
NA
2Random Forest Regression:
A random forest is a meta estimator that fits a number of classifying decision trees on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is controlled with the max_samples parameter if bootstrap=True (default), otherwise the whole dataset is used to build each tree.
n_estimators: int, default=100
The number of trees in the forest. Criterion: {“squared_error”, “absolute_error”, “poisson”}, default=”squared_error”
The function to measure the quality of a split. Supported criteria are “squared_error” for the mean squared error, which is equal to variance reduction as feature selection criterion, “absolute_error” for the mean absolute error, and “poisson” which uses reduction in Poisson deviance to find splits. Training using “absolute_error” is significantly slower than when using “squared_error”.
max_depth: int, default=None
The maximum depth of the tree. If None, then nodes are expanded until all leaves are pure or until all leaves contain less than min_samples_split samples.
min_samples_split: int or float, default=2
The minimum number of samples required to split an internal node:
- If int, then consider min_samples_split as the minimum number.
- If float, then min_samples_split is a fraction and ceil(min_samples_split * n_samples) are the minimum number of samples for each split.
min_samples_leaf: int or float, default=1
The minimum number of samples required to be at a leaf node. A split point at any depth will only be considered if it leaves at least min_samples_leaf training samples in each of the left and right branches. This may have the effect of smoothing the model, especially in regression.
- If int, then consider min_samples_leaf as the minimum number.
- If float, then min_samples_leaf is a fraction and ceil(min_samples_leaf * n_samples) are the minimum number of samples for each node.
min_weight_fraction_leaf: float, default=0.0
The minimum weighted fraction of the sum total of weights (of all the input samples) required to be at a leaf node. Samples have equal weight when sample_weight is not provided.
max_features: {“auto”, “sqrt”, “log2”}, int or float, default=”auto”
The number of features to consider when looking for the best split:
- If “auto”, then max_features=n_features.
- If “sqrt”, then max_features=sqrt(n_features).
- If “log2”, then max_features=log2(n_features).
-If None, then max_features=n_features.
Note: the search for a split does not stop until at least one valid partition of the node samples is found, even if it requires to effectively inspect more than max_features features.
max_leaf_nodes: int, default=None
Grow trees with max_leaf_nodes in best-first fashion. Best nodes are defined as relative reduction in impurity. If None then unlimited number of leaf nodes.
min_impurity_decrease: float, default=0.0
A node will be split if this split induces a decrease of the impurity greater than or equal to this value.
The weighted impurity decrease equation is the following:
N_t / N * (impurity - N_t_R / N_t * right_impurity
- N_t_L / N_t * left_impurity)
where N is the total number of samples, N_t is the number of samples at the current node, N_t_L is the number of samples in the left child, and N_t_R is the number of samples in the right child.
N, N_t, N_t_R and N_t_L all refer to the weighted sum, if sample_weight is passed.
Bootstrap: bool, default=True
Whether bootstrap samples are used when building trees. If False, the whole dataset is used to build each tree.
oob_score: bool, default=False
Whether to use out-of-bag samples to estimate the generalization score. Only available if bootstrap=True.
n_jobs: int, default=None
The number of jobs to run in parallel. fit, predict, decision_path and apply are all parallelized over the trees. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.
ccp_alpha: non-negative float, default=0.0
Complexity parameter used for Minimal Cost-Complexity Pruning. The subtree with the largest cost complexity that is smaller than ccp_alpha will be chosen. By default, no pruning is performed. See Minimal Cost-Complexity Pruning for details.
max_samples: int or float, default=None
If bootstrap is True, the number of samples to draw from X to train each base estimator.
- If None (default), then draw X.shape[0] samples.
- If int, then draw max_samples samples.
- If float, then draw max_samples * X.shape[0] samples. Thus, max_samples should be in the interval (0.0, 1.0).
3Support Vector Regression:
The implementation is based on libsvm. The fit time complexity is more than quadratic with the number of samples which makes it hard to scale to datasets with more than a couple of 10000 samples. For large datasets consider using LinearSVR or SGDRegressor instead, possibly after a Nystroem transformer.
Kernel: {‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’} or callable, default=’rbf’
Specifies the kernel type to be used in the algorithm. If none is given, ‘rbf’ will be used. If a callable is given it is used to precompute the kernel matrix.
Degree: int, default=3
Degree of the polynomial kernel function (‘poly’). Ignored by all other kernels.
Gamma: {‘scale’, ‘auto’} or float, default=’scale’
Kernel coefficient for ‘rbf’, ‘poly’ and ‘sigmoid’.
- if gamma='scale' (default) is passed then it uses 1 / (n_features * X.var()) as value of gamma,
if ‘auto’, uses 1 / n_features.
coef0: float, default=0.0
Independent term in kernel function. It is only significant in ‘poly’ and ‘sigmoid’.
C: float, default=1.0 Regularization parameter. The strength of the regularization is inversely proportional to C. Must be strictly positive. The penalty is a squared l2 penalty.
Epsilon: float, default=0.1Epsilon in the epsilon-SVR model. It specifies the epsilon-tube within which no penalty is associated in the training loss function with points predicted within a distance epsilon from the actual value.
Shrinking: bool, default=True
Whether to use the shrinking heuristic. See the User Guide.
max_iter: int, default=-1
Hard limit on iterations within solver, or -1 for no limit.

Limitations:

User may get a value conversion error in the scenario where the count of fields in the Microsoft Excel Input step differs from those that need to be passed to the ML: Model Builder step. The error occurs because of incorrect data type conversion in the Microsoft Excel Input step. The workaround is:
Ensure that the Microsoft Excel Input step has the same fields as that required in the ML: Model Builder step.
OR
- Ensure data type of all fields is String.